[ENH] Add basic maxscore writer/reader by HammadB · Pull Request #6825 · chroma-core/chroma

HammadB · 2026-04-05T02:19:21Z

Description of changes

This is PR #2 of a series unbundled from the hammad/sparse-maxscore-prototype branch. Stacked on hammad/sparse_posting_block which adds SparsePostingBlock and its blockstore integration.

This PR adds the writer, reader, cursor, and eager-mode BlockMaxMaxScore query implementation for sparse posting lists. Subsequent PRs will add lazy I/O cursors, a 3-batch pipeline, SIMD budget pruning, and segment-level integration.

New functionality
- BlockSparseWriter: Accumulates per-dimension deltas in a DashMap, merges with an optional previous BlockSparseReader on commit, re-chunks entries into fixed-size blocks (default 1024), and writes both data blocks and a directory block per dimension. Supports incremental add/delete.
- BlockSparseReader: Reads posting blocks per dimension via the blockfile prefix API. Provides open_cursor() for single-dimension iteration and a full query() method implementing the windowed BlockMaxMaxScore algorithm.
- PostingCursor: Eager cursor backed by fully decompressed SparsePostingBlocks. Supports advance() with roaring bitmap masks, drain_essential() for Phase 1 window accumulation, score_candidates() for Phase 2 non-essential merge-join, and window_upper_bound() for per-window essential/non-essential repartitioning.
- query() — BlockMaxMaxScore: Windowed (4096-slot) block-max MaxScore implementation with essential/non-essential term partitioning, bitmap-tracked flat accumulator, budget-based candidate pruning, and min-heap top-k extraction.
- Two-phase rescoring: SparseRescorer trait and rescore_and_select() for oversampled retrieval with exact-score refinement. Not currently used.
- Design doc (maxscore.md): Documents the algorithm, data layout, and per-phase query walkthrough. High level for posterity.

Test plan

11 integration test files covering the writer, reader, cursor, and query engine:

ms_01_blockfile_roundtrip — Serialize/deserialize SparsePostingBlock through blockfile write→flush→read.
ms_02_writer_basic — Write sparse vectors, commit, read back via reader.
ms_03_writer_incremental — Incremental updates: add, delete, overwrite across commits.
ms_04_writer_edge_cases — Single-entry dimensions, high-cardinality dimensions, empty deltas.
ms_05_cursor — PostingCursor advance, drain_essential, score_candidates, window_upper_bound.
ms_06_correctness — End-to-end query() results match brute-force dot-product.
ms_07_masks — query() with Include/Exclude roaring bitmap masks.
ms_08_edge_cases — Empty index, k=0, missing dimensions, single-doc queries.
ms_09_recall — Recall@k measurement against brute-force on randomized data.
ms_10_incremental_query — Query correctness after incremental writer updates.
ms_11_vectorized_scoring — Verifies drain_essential and score_candidates accumulator arithmetic.
Tests pass locally with cargo test

Migration plan

No migrations needed. This is a new index type with no existing on-disk data. Segment-level wiring is deferred to a later PR.

Observability plan

No new instrumentation in this PR. Tracing spans and metrics (block skip rate, essential/non-essential term counts, per-window candidate counts) will be added alongside segment integration in a follow-up PR.

Documentation Changes

No user-facing API changes. Internal design is documented in rust/index/src/sparse/maxscore.md.

github-actions · 2026-04-05T02:19:35Z

HammadB · 2026-04-05T02:19:36Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

[ENH] Wire maxscore reader in search #6899
[ENH] Add maxscore index to metadata segment #6880
[ENH] Add maxscore option in schema #6878
[ENH] Benchmark maxscore #6866
[ENH] Add SIMD for maxscore #6865
[ENH] Add maxscore lazy cursor #6829
[ENH] Add basic maxscore writer/reader #6825 👈 (View in Graphite)
[ENH] Add SparsePostingBlock #6823
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

propel-code-bot · 2026-04-10T00:05:05Z

Add MaxScore sparse posting writer/reader with eager BlockMaxMaxScore query path

This PR introduces a new sparse indexing/query path centered on rust/index/src/sparse/maxscore.rs, including MaxScoreWriter, MaxScoreReader, PostingCursor, MaxScoreFlusher, and a windowed eager query() implementation of BlockMaxMaxScore. The writer supports incremental set/delete updates, merges against a prior snapshot, rewrites posting blocks with directory metadata, and includes a suffix-rewrite optimization to avoid full-dimension rewrites when only tail ranges are affected. The reader can open per-dimension cursors and execute top-k scoring with essential/non-essential partitioning, bitmap-tracked window accumulation, budget pruning, and heap-based selection.

The PR also extracts reusable scoring primitives into rust/index/src/sparse/types.rs (Score, TopKHeap), wires the new module in rust/index/src/sparse/mod.rs, adds a design document at rust/index/src/sparse/maxscore.md, and adds broad integration coverage under rust/index/tests/maxscore/ (11 test files) for serialization roundtrip, writer behavior, incremental updates, cursor math, mask handling, correctness vs brute force, recall, and edge cases.

This summary was automatically generated by @propel-code-bot

propel-code-bot

Review found no issues; the enhancement appears solid, well-tested, and ready to merge.

Status: No Issues Found | Risk: Low

Review Details

📁 21 files reviewed | 💬 0 comments

sanketkedia · 2026-04-23T06:04:11Z

+    ///
+    /// Used by the suffix-rewrite optimization in `MaxScoreWriter::commit()`
+    /// to avoid loading blocks before the first affected offset.
+    pub async fn get_posting_blocks_from(


nit: this can named something like get_posting_blocks_range(start_seq, end_seq). Current name is a bit unclear

sanketkedia · 2026-04-23T18:19:50Z

+            if let Some(ref directory) = old_directory {
+                let old_block_count = directory.num_blocks() as u32;
+                let old_dir_part_count = if let Some(ref reader) = self.old_reader {
+                    reader.count_directory_parts(encoded_dim).await? as u32


not sure how much perf concern this is - both 1. reader.count_directory_parts() and 2. reader.get_directory() do a get_prefix call which could result in fetching blocks from s3. If there is cache churning after 1 and before 2 then 2 will also fetch from s3 the same blocks that 1 fetched. Just for getting a count. Is it possible to combine these two into one call maybe? Ignore this comment if it won't matter in practice

sanketkedia · 2026-04-23T18:24:32Z

+#[derive(Clone)]
+pub struct MaxScoreWriter<'me> {
+    block_size: u32,
+    delta: Arc<DashMap<u32, DashMap<u32, Option<f32>>>>,


this might also not matter much in practice - Arc<DashMap<u32, DashMap<u32, Option<f32>>>> v/s Arc<DashMap<u32, AsyncPartitionedMutex<Vector<(u32, Option<f32>)>>> might be more friendly for cpu caches. We don't need the inner hash map because different threads will be operating on different doc ids so they can simply append to a vector

probably not a concern yet. the slow part is the commit

sanketkedia · 2026-04-23T18:27:11Z

+    posting_reader: BlockfileReader<'me, u32, SparsePostingBlock>,
+}
+
+impl<'me> MaxScoreReader<'me> {


naming nit: will it be useful to segregate the getters (get_* ) methods into owned v/s non owned ones by naming them suitably?

what are the non-owned ones?

sanketkedia · 2026-04-23T22:19:30Z

+
+        let mut terms: Vec<TermState> = Vec::new();
+        for (idx, result) in cursor_results.into_iter().enumerate() {
+            let Some(mut cursor) = result? else {


why do we ignore the error here

it is propagated up?

sanketkedia · 2026-04-23T23:41:14Z

+                }
+                if mask.contains(doc) {
+                    let idx = (doc - window_start) as usize;
+                    bitmap[idx >> 6] |= 1u64 << (idx & 63);


a comment explaining this bitwise arithmetic would be useful for future readers

added comment

HammadB mentioned this pull request Apr 5, 2026

[ENH] Add SparsePostingBlock #6823

Open

2 tasks

This comment was marked as outdated.

Sign in to view

HammadB mentioned this pull request Apr 5, 2026

[ENH] Add maxscore lazy cursor #6829

Open

1 task

HammadB changed the base branch from hammad/sparse_posting_block to graphite-base/6825 April 6, 2026 16:51

This comment was marked as outdated.

Sign in to view

HammadB force-pushed the hammad/maxscore_writer_reader branch from 9d41c2f to f67eb32 Compare April 6, 2026 21:57

HammadB force-pushed the graphite-base/6825 branch from 0a24242 to 7e9e2b9 Compare April 6, 2026 21:57

HammadB changed the base branch from graphite-base/6825 to hammad/sparse_posting_block April 6, 2026 21:57

This comment was marked as outdated.

Sign in to view

HammadB changed the base branch from hammad/sparse_posting_block to graphite-base/6825 April 9, 2026 15:42

HammadB force-pushed the hammad/maxscore_writer_reader branch from f67eb32 to 2fed177 Compare April 9, 2026 18:03

HammadB changed the base branch from graphite-base/6825 to hammad/sparse_posting_block April 9, 2026 18:03

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 2fed177 to ffb0104 Compare April 9, 2026 19:13

Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from 5f52182 to 789dabf Compare April 9, 2026 19:13

This comment has been minimized.

Sign in to view

Sicheng-Pan reviewed Apr 9, 2026

View reviewed changes

Comment thread rust/index/src/sparse/maxscore.rs Outdated

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from e71b3f4 to 6b0e4e6 Compare April 9, 2026 23:09

Sicheng-Pan marked this pull request as ready for review April 10, 2026 00:04

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 6b0e4e6 to d57413d Compare April 10, 2026 00:08

Sicheng-Pan mentioned this pull request Apr 10, 2026

[ENH] Add SIMD for maxscore #6865

Open

1 task

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from d57413d to 9941a72 Compare April 10, 2026 03:11

Sicheng-Pan mentioned this pull request Apr 10, 2026

[ENH] Benchmark maxscore #6866

Open

5 tasks

Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from 5f15fc9 to c2105e3 Compare April 10, 2026 03:26

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 9941a72 to 0dec95c Compare April 10, 2026 03:26

Sicheng-Pan mentioned this pull request Apr 10, 2026

[ENH] Add maxscore option in schema #6878

Open

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 6ad7e5c to be9d78b Compare April 10, 2026 20:11

Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from c2105e3 to 36bf255 Compare April 10, 2026 20:11

Sicheng-Pan mentioned this pull request Apr 10, 2026

[ENH] Add maxscore index to metadata segment #6880

Open

propel-code-bot Bot reviewed Apr 10, 2026

View reviewed changes

Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from 36bf255 to df84501 Compare April 13, 2026 17:56

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 4c6a02c to e75da84 Compare April 13, 2026 17:56

Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from df84501 to ad84956 Compare April 13, 2026 18:19

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from e75da84 to 3faeec2 Compare April 13, 2026 18:19

Sicheng-Pan mentioned this pull request Apr 13, 2026

[ENH] Wire maxscore reader in search #6899

Open

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 3faeec2 to f3ea978 Compare April 14, 2026 22:04

Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from ad84956 to 894e694 Compare April 14, 2026 22:04

sanketkedia reviewed Apr 23, 2026

View reviewed changes

sanketkedia approved these changes Apr 23, 2026

View reviewed changes

sanketkedia reviewed Apr 23, 2026

View reviewed changes

Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from 894e694 to 35663b7 Compare April 24, 2026 00:02

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from f3ea978 to 116fb2a Compare April 24, 2026 00:02

HammadB and others added 9 commits April 23, 2026 18:47

[ENH] Add basic maxscore writer/reader

ee5fbc3

cln

9b28f72

cln

b9255ef

cln restack

7f4a825

Cleanup

dae8df2

Remove unused SparseRescorer trait and rescore_and_select

ade9a47

Rename BlockSparse* to MaxScore* for clarity

bef9a99

Suffix only modification

1ef0be7

Rebase

d65311f

Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 116fb2a to d65311f Compare April 24, 2026 02:05

This comment has been minimized.

Sign in to view

Address comment

db83e16

Conversation

HammadB commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

github-actions Bot commented Apr 5, 2026

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

HammadB commented Apr 5, 2026 • edited by Sicheng-Pan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment has been minimized.

Uh oh!

propel-code-bot Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanketkedia Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HammadB commented Apr 5, 2026 •

edited

Loading

HammadB commented Apr 5, 2026 •

edited by Sicheng-Pan

Loading

propel-code-bot Bot commented Apr 10, 2026 •

edited

Loading

sanketkedia Apr 23, 2026 •

edited

Loading